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Abstract 

Markov state models (MSMs) have become a popular approach for investigating the confor- 
mational dynamics of proteins and other biomolecules. MSMs are typically built from numerous 
molecular dynamics simulations by dividing the sampled configurations into a large number of 
microstates based on geometric criteria. The resulting microstate model can then be coarse- 
grained into a more understandable macro state model by lumping together rapidly mixing 
microstates into larger, metastable aggregates. However, finite sampling often results in the cre- 
ation of many poorly sampled microstates. During coarse- graining, these states are mistakenly 
identified as being kinetically important because transitions to/from them appear to be slow. 
In this paper we propose a formalism based on an algebraic principle for matrix approximation, 
i.e. the Nystrom method, to deal with such poorly sampled microstates. Our scheme builds a 
hierarchy of microstates from high to low populations and progressively applies spectral cluster- 
ing on sets of microstates within each level of the hierarchy. It helps spectral clustering identify 
metastable aggregates with highly populated microstates rather than being distracted by lowly 
populated states. We demonstrate the ability of this algorithm to discover the major metastable 
states on two model systems, the alanine dipeptide and trpzip2 peptide. 
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I. INTRODUCTION 



Markov state models (MSMs) are a powerful approach to model both the thermo- 
dynamics and kinetics of proteins and other biomolecules [JQ. In an MSM, one de- 
composes the conformational space of a molecule into metastable states (i.e. long-lived 
states that correspond to basins in the free energy landscape that ultimately determines 
a system's structure and dynamics) and calculates a transition probability matrix whose 
entries specify the probability of transitioning between each pair of states in some time 
interval, called the lag time of the model. Using the transition probability matrix, one 
can then calculate the relaxation of an ensemble of proteins upon the temperature jump 
[13I . Q, find the highest flux paths between pairs of states 



other analyses a 
protein folding 17 



ISHlTj]. or perform numerous 



In recent years, MSMs have been successfully apphed to study 



211, RNA folding |9|, l22| and protein-ligand binding [23 



25|. 



To build an MSM, one typically runs numerous molecular dynamics (MD) simulations. 
A geometric criterion (e.g. RMSD) is then used to cluster the sampled conformations into 
a large number of microstates and the transition probabilities between pairs of states are 
calculated from the number of transitions observed between them ^ . One assumption is 
that conformations within each microstate are structurally similar enough to ensure fast 
kinetic transitions between them. This normally requires a large number of microstates 
(on the order of 10, 000) even for small proteins 



. Unfortunately, poor statistics re- 



sulting from finite sampling can result in a significant number of microstates where few 
transitions into/out of the state are observed, giving the false appearance that the micro 
state is separated from others by a large free energy barrier. Moreover, while such mi- 
crostate models are often ideal for making a direct connection to experiments because of 
their structural and temporal resolution, extracting human intuition from them is difficult 



because of the large number of states 



26|. 



To facilitate human understanding, one typically lumps kinetically related microstates 
together to form a smaller number of macrostates. This coarse-graining is typically 
achieved using spectral methods (e.g. Perron Cluster Cluster Analysis (PCCA) algorithm 
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0, ll^]). The basic idea behind spectral methods is that metastabihty causes microstate 
transition probabihty matrices to have nearly block diagonal structures. As a result, the 
eigenvectors corresponding to the largest (slowest) eigenvalues are nearly piecewise con- 
stant and can be used to separate metastable states. Poorly sampled microstates will tend 
to appear in the top eigenvectors as nearly constants and, as a result, the spectral methods 
will separate them into their own macrostates. In the extreme case, poor statistics can 
even result in sparsely populated states that act as sources or sinks because transitions 
both in and out of them were not observed. Such sources and sinks may destroy a model's 
ability to predict long timescale dynamics. In order to alleviate this problem, Noe et al 



171 ] removed microstates that were weakly connected to the rest, while Bowman et al. 



suggested subsampling the MD conformations to reduce the impact of outliers. However, 
both approaches require ad hoc choices on the part of the model builder. 



A number of methods have now been proposed to deal with these issues 



28 



-|3l|. For 



examp 
rithm 



e^we previously presented the super- level-set hierarchical clustering (SHC) algo- 
which was inspired by recent developments in topological model construction. 



While this approach gave promising results 



22|, this approach relied on estimates of the 



densities of different microstates, which are unreliable due to the high dimensionality of 
biomolecule's conformational spaces. 

Here, we present a hierarchical formalism based on the Nystrom extension, an algebraic 
principle in matrix approximation, to rigorously treat sparsely populated states with poor 
statistics. We show that the removal of sparsely populated microstates will lead to a stable 
approximation of the top eigenvectors of the micro state transition probability matrix that 
are associated with macrostates. Furthermore, we use the Nystrom method to obtain an 
efficient lumping of microstates including those with poor statistics. To achieve this, we 
divided the microstates into multilevel subsets from high to low populations to capture 
the multiscale nature of biomolecule's free energy landscapes. We then apply spectral 
clustering to each level set. Such a method allows us to identify the dominant metastable 
conformational states with significant populations. Finally, we demonstrate the power of 
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this new scheme on both the alanine-dipeptide and the trpzip2 peptide system. 



II. METHODS 



A. Kinetic Lumping with Spectral Methods 



To construct a Markov State Model (MSM) that precisely describes the conformational 
dynamics of a system, one typically needs to exploit both geometric and kinetic informa- 
tion from MD simulations. Geometric distances, often measured with the RMSD distance 
between configurations, can be used to identify conformations that can likely interconvert 
quickly because of their close geometric proximity. For example, fc-medoids 0, or k- 
centers {2, 32] methods have been used to find kinetically-relevant clusterings of MD data. 



This geometric clustering is often followed by kinetic lumping into macrostates, typically 
by spectral clustering jj [23|. As discussed in the Introduction, a common issue caused 
by direct application of spectral method is that one captures many spurious macrostates 
consisting of a single, poorly sampled microstate. In order to further elaborate this issue, 
we first review the principles of kinetic lumping with spectral methods. 

Kinetic lumping aims to group together microstates that can interconvert rapidly. In an 
extreme case where metastable regions are separated by infinitely high free energy barriers, 
transition probabilities between these metastable macrostates will be zero. Therefore, the 
transition probability matrix P (a row Markov matrix satisfying Pij = 1 with Pij > 0) 
contains uncoupled Markov chains. Let X be the configuration space, Y be the microstate 
space and Z be the macrostate space. If P is uncoupled with respect to the macrostate 
partition {Zi, . . . , Zm}, P can be rewritten as a block diagonal matrix after a permutation 
of the state indices {Zi, . . . , Zm}- 



Pi 

P2 


... 









P 



M 
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In this case, there are M (right) eigenvectors which are piecewise constants over Zi (i = 
1, . . . , M) and associated with the maximal eigenvalue, 1. 

In biological macromolecules, metastable regions of the free energy landscape are nor- 
mally separated by barriers greater than a kT (i.e. thermal fluctuations). In this case, we 
can view the new transition matrix P as nearly uncoupled Markov chains, i.e. a perturbed 
transition matrix P = P + E where P is uncoupled and E is a matrix with small norm, de- 
noted by e = 11-^11. For small enough e, P has M eigenvalues A, = 1 — 0(e) (i = 1, . . . , M) 
whose eigenvectors are nearly piecewise constants, Vi{y) = J^iLi'^i^Ziiv) + 0{e). Nearly 
uncoupled Markov chains are helpful to describe the conformational dynamics with a sep- 
aration of timescales, where dynamics within each block P^ are fast compared to slower 
dynamics between different blocks Pki- 

The key idea in kinetic lumping is to study the leading right eigenvectors of the row 
Markov matrix P = D^^K, where Ki j is the number of transitions from microstate 
i to j observed from simulations, and D = diag{di) where di is the total number of 
transitions starting from microstate i. The nearly piecewise constant structures of the 
right eigenvectors of P leads to the partition of metastable macrostates. In practice, the 
sign structures of the right eigenvectors are often used for the lumping to construct MSMs 
27|. 

A major issue associated with this spectral method is that it captures many spurious 
macrostates with low population (small di). Since microstates must have small RMSD 
diameters to ensure kinetic similarity within the state, a significant number of microstates 
will have low populations and be nearly disconnected from the rest of the state space due 
to insufficient sampling. These noisy states often appear as being kinetically distinct 
in the top right eigenvectors of the transition matrix. To remove such noise from the 
signal, regularization must be applied. Some heuristic approaches have been adopted in 
the literature, like leaving-out certain rarely visited states or subsampling the MD 
data |26|. However, how much data to discard remains an open question in such ad hoc 
methods. 
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In the next section, we will introduce the Nystrom Method, an algebraic principle for 
matrix approximation by sub-rows (columns) that provides a mathematical foundation 
for the heuristic methods described above and, moreover, points to a new technique for 
coarse-graining MSMs. 

B. Nystrom Method 

Let us define a transition count matrix K whose elements {K{i,j)) denote the number 
of transitions observed from state i to state j for all i,j G Y. We will expect that the 
transition probability matrix P = D^^K with D = diag{di) where di = is a 

nearly uncoupled Markov matrix. Unfortunately when there are not sufficient samples, 
there are often spurious macrostates with low population which contribute to the nearly 
uncoupled dynamics in P. In the following we will suggest a method based on the Nystrom 
method to eliminate these noisy states. 

1. Application to symmetric matrix 

Originally, the Nystrom approximation was developed to find an approximate eigen- 
decomposition of a symmetric matrix based on its submatrices. We first illustrate this idea 
by applying it to the transition count matrix, which is symmetric due to the reversibility of 
molecular dynamics. Assume that a symmetric and non-negative transition count matrix 
K is partitioned by 



Let A = UAU^ be the eigen-decomposition of A, then we can define the Nystrom 
approximation of K to be 



A B 



K 



B'^ C 




U 



A 



B 



K 



A A~^U^B 



(2) 



B^UA-^ 



B^ B^A~'B 
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Therefore \\K — K\\ = \\C — B'^A^^B\\ ^ if the entries of C and B are far smaller 
than those of A. In this case, = [f/^, A^^U'^B] provides a good approximation of the 
leading eigenvectors of K and the original K. The sign structure of U decided by A will 
be stable compared to K. 



2. Application to transition probability matrix. 



To construct MSMs, we need to deal with the non-symmetric transition probability 
matrix P. Our purpose is to find the right eigenvectors u of P, s.t. Pu = An, which will 
be used to find metastable states. Recall that P = D~^K. Therefore, Pu = D~^Ku = 
\u D^^/'^KD'^/'^v = \v where v = D^^'^u. So, to find the eigenvectors of P, it suffices 
to find the eigenvectors v of the symmetric matrix Kn = D~^^^KD~^^^, through which the 
right eigenvectors of P can be obtained by m = D^^^^v. To find a robust coarse-graining 
into metastable states, the Nystrom method can be extended to treat Kn- 

Consider Kn partitioned in the following way. 



Kn 









-1/2 














Da 







A 


B 




Da 










Db 




B^ 


C 







Db 



-1/2 



(3) 



where D = diag{di) = diag{DA, Db), and di = Kij. 



^ —1/2 

Define Da = diag{^- Aij) and eigen-decomposition D^^ AD 



-1/2 



VAV^. Then 



D^^AU = UA where U = D'^^'^V and U^DaU = I. Define £)b = diag(J2j Btj). 
Nystrom extension for Kn is 



Kn 



V 

b-'/'B^b-"'vA~^ 



A 



A-^V^b-^^'^BD-^'^ 



Bn 



Dn B^^^^Bn 



where An 



b~^^'^Ab~^^'\ Bn = b~A^'^Bb~^^'\ If we concentrate on the major submatrix 
eigen decomposition An = VAV"^ , and thus U = bj^^'^V is the right eigenvector 



of Pa = Dj^A. The Nystrom method therefore approximates the right eigenvectors of 
P = D-'K as [f/^, {Ds'B^UA-YV- 

It is clear that in the Nystrom method, Kn approximates Kn well if C, i? << A, whence 
Da ~ Da, Db ~ Db, and D'j^^'^CD^^^ — B^A'^Bn ~ 0. To ensure that the entries of 
C and B are as small as possible compared to those of A, one can either assign all the 
sparse (or rare) states to the block C or randomly down-sample the conformations such 
that rare states are not included in the block A. These two schemes correspond to the 
heuristic treatments by Noe et al. and Bowman et al. 26], respectively. 

In this paper we use the following greedy method to construct the submatrix A deter- 
ministically. We first sort the rows (columns) of K in decreasing order according to the 
row (column) sum. We then choose A to be the m-hj-m matrix consisting of the m largest 
rows and columns. In practice, one often observes that the distribution of microstate pop- 
ulations is roughly a power law distribution, which implies B collects exponentially many 
states/rows with exponentially small populations. 



3. Recovery of original Markov Chains. 

In the Nystrom approximation discussed above, the eigenvectors of the submatrix A 
have the same sign structures as the right eigenvectors of the transition probability matrix 
if the entries of A are sufficiently large compared to those of B and C. However, in order 
to recover the original Markov chains using Nystrom method, a few other conditions need 
to be satisfied. Let's consider that the original transition probability matrix P = D~^K 
can be partitioned into M macrostates: M. = {Zi, . . . , Zm}- If P is an uncoupled Markov 
chain with M piecewise constant eigenvectors associated with eigenvalue 1, the partition 
M can be exactly recovered under the following three conditions. First, at least one 
microstate in any macrostate Zj has to appear in the submatrix A, otherwise there is no 
possibility to reconstruct it. Second, any macrostate Zi must not break into disconnected 
parts in A. This condition ensures that downsampling in the Nystrom approximation 
does not generate incorrectly break one macrostate into multiple pieces. Finally, every 
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rare microstate with indices in B should be directly connected to one or more large 
microstates with indices in A. With these conditions, one can recover such macrostates 
by the Nystrom method, merely from submatrices A and B, via the following steps. 

1. Find eigen- decomposition of A^ = D~^'^ AD'^'^ = VAV^; 

1/2 

2. Construct U = D/'^V, find piecewise constant vectors in column vectors of U and 
the partition {Zi, . . . , Z^} associated with them; 

3. Assign each rare state k to the partition it has the maximal transition probability 
to, maxi P^^^ = max, J^jez, Pkj- 

These conditions can be summarized precisely in Theorem A (see Appendix). 

However, in molecular dynamics simulations of biological macromolecules, we normally 
deal with nearly uncoupled Markov Chains determined by the underlying metastable 
regions of the free energy landscape. In this case, we can introduce a small perturbation 
on the uncoupled Markov Chain discussed above: A = Aq + E with the block-diagonal Aq 
for the uncoupled Markov Chain. The partition P = {Zi, . . . , Zm} can be recovered using 
Nystrom method if the following three conditions can be satisfied. The first condition is 
identical with the uncoupled Markov Chain case and requires that some microstates of 
Zi are selected from A. In the second condition, any two microstates in Zi can always 
be connected directly or through other microstates in A regardless of the magnitude of 
the perturbation e = \\E\\. In other words, Zi does not break into disconnected parts in 
A with small enough perturbation e. The last condition requires that every microstate 
in B should be always directly connected to one or more microstates in A regardless 
of the size of e. In practice, a sufficiently long lag time may be used to obtain the 
transition probability matrix in order to satisfy this condition, which can be implemented 
by applying step 3 above to transition probabilities with long enough lag time. These 
conditions can be summarized in Theorem B (see Appendix). 

The challenge for applying the Nystrom method is the lack of prior knowledge of how 
best to split the transition matrix into submatrices A and B. Moreover, due to the 
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multiscale nature of free energy landscapes 



22| . this split will largely depend on the scale 



of the model, i.e. how coarsely we decompose the conformation space. Therefore, we 
pursue a hierarchical Nystrom approximation by ranking all the rows (columns) by their 
sums (or populations) and dividing the matrix K into multilevel blocks. 



C. Multilevel analysis on free energy landscapes 

The free energy landscapes of biomolecules are rugged, having a large number of local 
minima separated by barriers with various height. Therefore the metastable macrostates 
have a multiscale nature depending on the energy barriers and equivalently the intrinsic 
timescales. Such a multiscale landscape can be accurately represented by a cluster tree 



(Figure ^ |22| 



Given a free energy landscape, construction of such cluster trees can be performed as 
follows, (a) Divide the conformation space into overlapping level sets according to free 
energy, so that each level contains all previous levels with lower free energy, (b) For each 
level, apply spectral clustering to the largest connected component, (c) A cluster tree is 
generated by connecting nodes lying in adjacent levels that have nonempty intersection. 

An example is given in Figure [1] In the coarsest level, two metastable states (corre- 
sponding to the two nodes belonging to level 1 in Figured]) are desired with a large implied 
time scale for transitioning between them. In a less coarse model, the system may exhibit 
four metastable states (corresponding to the four nodes belonging to level 2 in Figured]), 
and two of them are crossed with relatively faster speed. To capture metastable states, 
one has to keep in mind such a multiscale nature. 

The challenge in conformation dynamics lies in that we do not know the underlying 
free-energy landscape. Even though the free energy will decide the sample distribution 
in conformation space, it is impossible to reach an accurate estimation of densities in 
such a high dimensional space. Therefore it is impossible to pursue an accurate multilevel 
analysis of such free energy landscapes. Fortunately the Nystrom method leads us to a 
reasonable multilevel approximation scheme. 
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Microstates with high populations are visited frequently and, therefore, tend to be 
highly correlated with metastable states or free energy basins. On the other hand, mi- 
crostates with low population are only visited rarely, suggesting they lie on energy barrier. 
Therefore we may apply the Nystrom method progressively from highly populated states 
to rarely visited states, which leads to a natural way to explore free energy landscapes 
according to their multiscale nature. The following algorithm works efficiently in the 
application examples shown in the Results section. 

Algorithm 1 Hierarchical Nystrom Extension Graph (HNEG) 

(1) Sort microstates by their population in decreasing order, and divide all microstates into 

m level sets: top pi,P2, ■ ■ ■ iPm populated states, where pi < P2 < • • • < Pm] 

(2) On each level, perform spectral clustering only with top pj (1 < j < m) microstates, 

(3) Draw a Hierarchical Nystrom Extension Graph: 

A. Node set: for each level, each cluster/group is represented by a node; 

B. Edge set: an edge is added if the associated two nodes are in adjacent levels with 
nonempty intersections; 

C. Gradient flow arrows: for each edge, an arrow is added pointing from high to low level; 

(4) Lumping: 

A. Set each attraction node (with zero out-degree) as metastable states; 

B. Expand those metastable states to its associated attraction basins (the intermediate 
nodes which flow down to single attraction nodes along gradient flows); 

C. Assign microstates on barrier nodes (those branching nodes which flow down to more 
than one attraction nodes) to the attraction basin with maximal transition probability. 



Unlike the case of a cluster tree built from a known free-energy landscape, the hierar- 
chical Nystrom method for nearly uncoupled Markov chains generally constructs a graph 
instead of a tree. Note that in this algorithm, the most important free parameter is how 
to divide the microstates into level sets, i.e. the choice of the pj. In this study, we have 
applied the Bayes factor approach introduced by Bacallado et. al. 33| to select the opti- 
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mal level set. A Bayes factor {B{Li, L2) = P{Li \ D)/P{L2 \ D)) is the ratio of posterior 
probabilities for two lumpings {Li and L2) given finite sampling and a common set of 
microstates (D). We first generate a large number of models by systematically varying 
the level sets, and then select the optimal level set as that can produce the model with 
the highest posterior probability. The second important free parameter is how to choose 
the number of clusters during spectral clustering within each level set. In this work, the 
number of clusters will be chosen based on the maximal spectral gap in top eigenvalues 
as in PCCA I5, 27]. 



D. Simulation Details 

Alanine Dipeptide The alanine dipeptide dataset is taken from Chodera et al. 4| 
It consists of 975 20-ps NVE simulations with conformations stored every 0.1 ps, for a 
total of 195k conformations. Please refer to 4] for additional details on the simulations. 
We split the conformations into 5000 microstates using a K-centers clustering algorithm 
[2I. The distance between a pair of conformations is determined by their backbone root 
mean square deviation (RMSD). 



Trpzip2 The trpzip2 simulation dataset is taken from Zhuang et al. [13|. It contains 
830 50-ns MD simulations with conformations saved every lOOps. This results in a total 
of 415k conformations. Please refer to 



13[ for additional details on the simulations. We 
1 to divide the conformations into 2000 clusters 



have performed K-centers clustering 
based on the heavy atom RMSD. 

III. RESULTS 



A. Alanine-dipeptide 

To validate the algorithm developed here, we first apply it to a simple alanine dipeptide 
system. For alanine dipeptide, it is easy to obtain equilibrium sampling and an accurate 
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representation of the free energy landscape by projecting it onto a pair of torsion angles: (f) 
and tp (see Fig. [2^). Therefore we can visualize the resulting states and check their various 
properties on this projection of the free energy landscape. Such projection is displayed in 
Fig [2]d, and the free energy is calculated hj W = — A;Tln(Pj/Po); where Pi is the number 
of conformations that lie in bin i divided by the total of number conformations, and Pq 
is a constant. We discretize the torsion plane by dividing it into square bins of 10° by 
10°. As shown in Fig there are six minima or metastable states centered at around: 
(-140°, 160°), (-60°, 140°), (-140°, 160°), (-60°, -50°), (50°, -100°), and (40°, 60°). 

In order to generate the Hierarchical Nystrom Extension Graph (HNEG) to describe 
the conformational dynamics of the alanine dipeptide, we first applied the k-centers clus- 
tering algorithm to divide the conformations into 5000 microstates. Clusters generated 
from the k-centers clustering algorithm have an approximately uniform distance to their 
cluster centers, and thus result in a strong correlation between the cluster population 



and its density 



32| . We then followed the procedure introduced in Section IIC 



to construct the hierarchical directed graph using the Nystrom Expansion method. One 
example of a HNEG is shown in Fig. [2t. This graph is generated using the level set 
[0.1, 0.2, 0.3, . . . , 0.8, 0.9], and arrows are used to connect intersecting nodes in the adja- 
cent levels as gradient flows. There are six attraction nodes (labeled from 1-6) each lying 
in to a different free energy basin as shown in Fig Attraction nodes 1 and 2 appear 
in the first level (top 10% population), indicating the deepest free energy basins. While 
attraction nodes 5 and 6 belong to the last level and correspond to the shallowest free 
energy basins. These results are consistent with projection of the free energy landscape 
shown in Fig. [2b. Therefore, the HNEG is able to provide a hierarchical representation of 
the free energy landscape. Moreover, starting from the 6 attraction nodes, we can further 
lump all the microstates into six metastable states following Algorithm 1 in Sec IIC. 

To compare the performance of our lumping algorithm with the popular PCCA method, 
here we apply the Bayes factor approach 33|]. As discussed in Sec IIC, the Bayes fac- 
tor compares the posterior probabilities of two lumpings constructed from a common 
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microstate set, and the larger the Bayes factor is, the more comparative advantage the 
first lumping has. For example InB = 50 implies that P{Li \ D)/P{L2 \ D) = e^^ ~ 
5.18 X 10^^, whence lumping Li is much better than L2. Since MSMs generated by our 
algorithm is not unique and depend on the construction of level sets, we have generated a 
number of different MSMs using a systematic variation of the level sets to compare with 
the PCCA method. 

The level sets are mainly determined by two factors: the population cut-off (i.e. mi- 
crostates to be included in the submatrix A in the Nystrom method) and the number of 
levels. We also noticed that the largest microstate has a population of 6%, so we select 
the 1st level to be at 10% to ensure a sufficient number of states in each level. Specifi- 
cally, we have selected 9 different population cut-offs (pm)'- 20%, 30%, . . . , 90% and 99% 
and 10 different numbers of levels (from 6 to 15). For example, the combination of 90% 
cut-off and 9 levels will generate a level set of [0.1, 0.2, 0.3, . . . , 0.8, 0.9]. Altogether, we 
have obtained 90 different level sets for MSM construction. 

The Bayes factor test shows that the models generated by our algorithm have a con- 
sistently higher posterior probability (In B > 100) than those obtained from the PCCA 
method (see Fig. |3]). Moreover, the models with six metastable states have the highest 
Bayes factors, indicating that a 6-state model is an optimal choice, which is consistent 
with the earlier study |4|. 



B. Trpzip2 



In this section, we have further applied our algorithm on a larger system: a 12-residue 
/3-hairpin trpzip2 peptide. Its foldin g h as been extensively studies by both experimen- 
tal and computational techniques In particular, a few recent studies 
have constructed MSMs from MD simulations to identify the metastable states during its 
folding process {4, 13|. 



The Bayes comparison results clearly demonstrate the advantage of our algorithm over 
the PCCA method. As with the alanine dipeptide system, we have generated a number of 
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different lumpings by varying the level sets of the 2000 microstates generated by K-centers 
clustering. Specifically, we select 13 different population cut-offs {pm. =41%, 45%, 50%, 
55%,..., 90%, 95%, 99%) and 20 different numbers of levels (m=ll,...,30), resulting in 
a total of 260 different models. As shown in Fig. HI all 260 models have substantially 
higher posterior probabilities than those obtained from the PCCA method. Furthermore, 
models with 13 metastable macrostates have the highest posterior probabilities, which 
are consistent with our previous study [l3|. Therefore, we select the model with the 
highest posterior probability as the optimal lumping. In Fig. 5, we have displayed the 
representative structures from each of the macrostates of the optimal lumping together 
with its equilibrium population obtained from the MSM. The macrostate 3 has the highest 
population (around 38%) and corresponds to a folded hairpin structure, indicating that 
the trpzip2 peptide still contains a large fraction of the hairpin component at 350K. 

Our results also show that the new algorithm efficiently captures highly populated 
metastable regions of the transition probability matrix, while the PCCA method first 
separates rarely visited sparse states. To visualize this, we have compared the nearly 
diagonal block structures of the transition probability matrices of a representative lumping 
obtained by our algorithm and PCCA (see Fig. [6ti). These results show that PCCA has 
the tendency to first separate sparsely populated states as metastable states (6 out of 
11 states have populations < 2%), while our algorithm efficiently identifies the highly 
populated metastable regions. Among the 260 models generated by our algorithm, the 
number of large states (> 2%) converges to 9 with a total of less than 20 metastable 
states (see Fig. [Bt). However, PCCA needs models with nearly 80 states in order to 
identify the same number of large metastable states. 

Our results also demonstrate that the identification of the major metastable states 
using the Nystrom method is robust to leaving out different fraction of rarely visited 
microstates. After including > 50% of the most populated microstates, our algorithm 
consistently identifies the same number of large macrostates (with population > 1%, 2% or 
3%) (see Fig. [7]). In order to further check whether our algorithm can consistently identify 
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large macrostates containing similar conformations in addition to the same populations, 
we have performed pairwise mutual information calculations between the optimal lumping 
and all other lumpings. For a pair of two lumpings (/ and g), the mutual information is 
defined as 

/(/, 9) = Y1 = ^, 9{x) = J) log pf//f^Vp1lT'^V 

^ PUi^) =VPxig{x) = 3) 

where P{f{x) = i) is the probability for microstate x belonging to macrostate i in the 
lumping /, and P{f{x) = i,g{x) = j) is the joint probability for microstate x belonging 
to both macrostate i in the lumping / and macrostate j in the lumping g. 

In Fig. [HI we have plotted the histogram of pairwise mutual information between the 
optimal lumping (with thirteen macrostates) and all other lumpings generated by our 
algorithm (259 of them). All these lumpings from our method have an average mutual 
information of 2.104. This value is close to its theoretical upper limit (the entropy of 
the optimal lumping, 3.313), and well above that of the model produced by the PCCA 
method (with thirteen macrostates and the mutual information of 0.785). For better 
comparison, we have also shown that the mutual information between random lumpings 
and the optimal lumping is only around 0.0513. These results clearly demonstrate that 
models generated by our Nystrom method with different level sets consistently have a 
larger similarity with the optimal lumping compared to the PCCA method. 

In order to further illustrate that different lumpings generated from our method contain 
a common set of large macrostates that share similar protein conformations or microstates, 
we have plotted the the joint probability P{f{x) = i,g{x) = j) between the optimal 
lumping and a representative lumping (see Equ. IHfor the detailed definition). We chose 
the representative lumping to have a mutual information close to the mean value of all the 
lumpings generated by our algorithm. Both the optimal lumping and the representative 
lumping contain 9 large macrostates (with population > 2%). The joint probability matrix 
is thus a 9 by 9 matrix (see Fig. [9]), and each matrix element represents the overlapping 
of the microstates in a pair of macrostates each belonging to a different lumping. After 
a permutation, it is clear that large macrostates in two compared lumpings contain a 

17 



large fraction of identical microstates (with large diagonal matrix elements but small 
off-diagonal elements). 



IV. CONCLUSION AND DISCUSSION 



We have developed an algorithm based on the Nystrom method and its multilevel 
extensions to construct MSMs from Molecular Dynamics simulation trajectories. Our 
algorithm is shown to be efficient in eliminating noise due to insufficient sampling by 
focusing on the dominant submatrix of the transition probability matrix. Therefore, the 
Hierarchical Nystrom Extension Graph (HNEG) method can greatly improve over other 
popular kinetic clustering algorithms such as PCCA, which tends to identify sparse states 
that are very weakly coupled to the rest of phase space before identifying real metastable 
states in denser regions of phase space. We have demonstrated that HNEG can generate 
a number of MSMs at different resolutions. 

In practice, HNEG has a similar protocol to Super-level-set Hierarchical Clustering 
(SHC) which was developed by the authors as a prototype before 221]. SHC relies on 
the rough estimation of the conformation density using the populations of microstates 
from K-centers clustering. However, estimating densities in high dimensional spaces is 
challenging, and the K-centers algorithm only generates clusters with approximately equal 
radii, so small variances in the cluster radius may induce large volume differences. On 
the other hand, the algebraic framework based on the Nystrom method does not rely 
on density estimation and leads to efficient algorithms to pursue metastable macrostates 
with a multiscale analysis. 
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Appendix 



Theorem A (Exact Recovery). Let di = Ylj -^ij degree of micro state i, and 

P = D~^K he an uncoupled Markov matrix with respect to partition P = {Zi, . . . , Zm}- 
Given 9 > 0, denote by 1$ = {i : di > 6} the microstates selected for submatrix A in 
Nystrom method. Then Nystrom method will exactly recover those Zk, if and only if the 
following holds: 



2. Any two microstates in Zj. will be connected by a path in Iq; 

3. For any microstate of Z^. outside Ig, it is directly connected to a microstate of Z^ 
in Ig. 

Proof of Theorem A Since P = D^^K is an uncoupled Markov matrix, the subma- 
trix A is block-diagonal. The first condition and the second condition ensures that the 
microstates of which are selected in A are within the same block, whence there is a 
right eigenvector of D'j^A, Uk = Iz^n/e indicator function of Z^ Hlg. Now consider 

the Nystrom approximation 



The j-th row in B lz^.nie summarizes the probability that microstate j which is 
outside Ig, to be connected in ZkHlg. If j ^ Z^, by uncoupled Markov matrix assumption, 
such a probability is 0; If j G Z^, such a probability could be 1 or depending on whether 
j is directly connected to Z^n Ig, whence by the third condition, the probability must be 
1 in this case. In summary, Uk is an indicator function on lz^. on all microstates, whence 
Zk is exactly recovered. 

Theorem B (Asymptotic Recovery under Noise). Let P = D~^K = Pq + E (e = 
\\E\\) be a nearly uncoupled Markov matrix with respect to partition Ai = {Zi, . . . , Z]\j}. 



1 



ZkDlg^ 0; 
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Given 9 > let 1$ = {i : di > 6} . Then Nystrom method with suhmatrix selected from Iq 
rows and columns will recover Zk as e ^ 0, if and only if the following holds: 

1. ZkHle^fH. 

2. Any two microstates in Zk will be connected by a 'highway' in Ig, in the sense 
that for any i,j G Zk fl Ig, there is a path {i = ko, ki, . . . ,kt = j) in Ig such that 
lim^^Q Kkt_j^^kt > 7 > for some constant 7; 

3. Any microstate of Zk outside Ig is directly connected to some microstates of Zk 
'strongly' in Ig, in the sense that for any i ^ Zk (k = 1, . . . , M), there is j & ZkCi Ig 
such that lim^^o-^jj > 7 > for some constant 7. 

Proof of Theorem B The proof of this theorem follows the same reasoning above. 
Notice that when P = D^^K = Pq + E [e = be a nearly uncoupled Markov 

matrix with respect to partition M. = {Zi, . . . , Zm}, one has Uk = Iz^n/e + ^(^) 
D^^B^UkXk^ = Dg^B^lz^ + 0(e). So in the limit e 0, the three conditions recover 
the three conditions in noise- free case. 
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Figures 



Level 4: 
Level 3: 

Level 2: 

Level 1: 




FIG. 1. Schematic figure for tlie multilevel analysis of a free energy landscape, (a) a 
1-D free energy landscape divided into four levels; (b) a cluster tree representing this 
free energy landscape. At level one, two nodes are formed that correspond to the two 
deepest free energy minima. At level two, four nodes are identified for four free energy 
minima. At level three and four, the number of nodes are reduced since some free 
energy minima are connected. 
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FIG. 2. An illustration of Hierarchical Nystrom Extension Graph (HNEG) applied to 
the alanine dipeptide system, (a) A conformation of the alanine dipeptide with two 
torsion angels (0 and ip) labeled, (b) The free energy landscape projected onto the (j) — ip 
plane, where the red color indicates regions of high density or low free energy, (c) A 
Hierarchical Nystrom Extension Graph containing 9 levels constructed for this system. 
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FIG. 3. Bayesian comparison of MSMs constructed by the Hierarchical Nystrom 
Extension Graph (HNEG) and the PCCA method for the alanine dipeptide. The y-axis 
displays the logarithm of the posterior probability (ln(P(Li | D))) for models generated 
by HNEG (red) and PCCA(black). The logarithmic Bayes factor 
\iiB = ln(P(Li I D))hneg - ln(P(L2 | D)pcca) > 100, indicating that HNEG 
consistently provides a better lumping than PCCA. 
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FIG. 4. Bayesian comparison of MSMs constructed by the Hierarchical Nystrom 
Extension Graph (HNEG) and the PCCA method for the trpzip2 peptide. The y-axis 
displays the logarithm of the posterior probability (ln(P(Li \ D))) for models generated 
by HNEG (red) and PCCA(black). The logarithmic Bayes factor 
\nB = ln(P(Li | D))hneg - ln(P(L2 | D)pcca) e [250,500], indicating that HNEG 
consistently provides a better lumping than PCCA for the trpzip2 system. 
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0(7.4%) 1(11.5%) 2(1.2%) 3(38.5%) 4(7.0%) 
5(7.1%) 6(4.8%) 7(3.7%) 8(6.8%) 

9(7.7%) 10(1.8%) 11(1.0%) 12(1.5%) 

FIG. 5. Representative structures of the 13 macrostates from the optimal lumping 
(with the highest posterior probability) for the trpzip2 system. Their equilibrium 
populations are also displayed. Macrostate 3 corresponds to a folded hairpin structure 
and has the largest population (38.5%), indicating that the trpzip2 peptide still has a 
significant fraction of the folded structure at 350K. 
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FIG. 6. (a) An illustration of block diagonal structures of the microstate transition 
probability matrices. There are 2000 microstates in total, and the matrices are 
permuted to group microstates that belong to the same macrostate together. The 
results show that PCCA (left panel) tends to separate nearly disconnected small blocks 
first, while HNEG (right panel) focuses on identifying the well populated macrostates. 
The number of macrostates for both lumpings is 11. (b) HNEG successfully identify the 
large macrostates (around 9) with a much smaller total number of macrostates (< 20) 
compared to PCCA (> 80). 
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FIG. 7. The Hierarchical Nystrom method can robustly identify the large metastable 
macrostates with population greater than l%(red), 2%(green) and 3%(blue), when 
varying the fraction of data that are included in the submatrix A (see Sec II for details) 
The percentage of data we include (P^) in the submatrix A is varied from 41% to 99%. 
The number of large macrostates keeps the same after we include 50% of the data or 
more. 
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FIG. 8. Histogram of pairwise mutual information (with a mean value of 2.104) 
between the optimal lumping and all other lumpings (259 of them) obtained from the 
Nystrom method by varying the level sets (red). For comparison, the mutual information 
between a lumping generated by PCCA (with 13 macrostates) and the optimal lumping 
is only 0.785 (green). The entropy of the optimal lumping (upper limit of the mutual 
information) is 3.313 (black), while the averaged mutual information between random 
lumpings (we produced 259 random lumpings) and the optimal lumping is 0.051. 
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FIG. 9. Illustration of the overlapping of microstates (or protein conformations) 
assigned to large macrostates in different lumpings. The optimal Lumping (A) is 
compared with a representative lumping (B) with mutual information at around 2.1. 
Both of these lumpings contain 9 large macrostates states with population > 2%. The 
joint probability matrix PA,B{hj) indicates the overlapping of microstates assigned to 
macrostate i in the optimal lumping A and macrostate j in the representative lumping 
B. Pa,b has large diagonal elements but small off-diagonal elements after permutation. 
These results indicate that the large macrostates in the two lumpings share a relatively 
large fraction of identical microstates. 
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